19 research outputs found
Recaptured Raw Screen Image and Video Demoir\'eing via Channel and Spatial Modulations
Capturing screen contents by smartphone cameras has become a common way for
information sharing. However, these images and videos are often degraded by
moir\'e patterns, which are caused by frequency aliasing between the camera
filter array and digital display grids. We observe that the moir\'e patterns in
raw domain is simpler than those in sRGB domain, and the moir\'e patterns in
raw color channels have different properties. Therefore, we propose an image
and video demoir\'eing network tailored for raw inputs. We introduce a
color-separated feature branch, and it is fused with the traditional
feature-mixed branch via channel and spatial modulations. Specifically, the
channel modulation utilizes modulated color-separated features to enhance the
color-mixed features. The spatial modulation utilizes the feature with large
receptive field to modulate the feature with small receptive field. In
addition, we build the first well-aligned raw video demoir\'eing
(RawVDemoir\'e) dataset and propose an efficient temporal alignment method by
inserting alternating patterns. Experiments demonstrate that our method
achieves state-of-the-art performance for both image and video demori\'eing. We
have released the code and dataset in https://github.com/tju-chengyijia/VD_raw
Automatic gauge detection via geometric fitting for safety inspection
For safety considerations in electrical substations, the inspection robots are recently deployed to monitor important devices and instruments with the presence of skilled technicians in the high-voltage environments. The captured images are transmitted to a data station and are usually analyzed manually. Toward automatic analysis, a common task is to detect gauges from captured images. This paper proposes a gauge detection algorithm based on the methodology of geometric fitting. We first use the Sobel filters to extract edges which usually contain the shapes of gauges. Then, we propose to use line fitting under the framework of random sample consensus (RANSAC) to remove straight lines that do not belong to gauges. Finally, the RANSAC ellipse fitting is proposed to find most fitted ellipse from the remaining edge points. The experimental results on a real-world dataset captured by the GuoZi Robotics demonstrate that our algorithm provides more accurate gauge detection results than several existing methods
HDR Video Reconstruction with a Large Dynamic Dataset in Raw and sRGB Domains
High dynamic range (HDR) video reconstruction is attracting more and more
attention due to the superior visual quality compared with those of low dynamic
range (LDR) videos. The availability of LDR-HDR training pairs is essential for
the HDR reconstruction quality. However, there are still no real LDR-HDR pairs
for dynamic scenes due to the difficulty in capturing LDR-HDR frames
simultaneously. In this work, we propose to utilize a staggered sensor to
capture two alternate exposure images simultaneously, which are then fused into
an HDR frame in both raw and sRGB domains. In this way, we build a large scale
LDR-HDR video dataset with 85 scenes and each scene contains 60 frames. Based
on this dataset, we further propose a Raw-HDRNet, which utilizes the raw LDR
frames as inputs. We propose a pyramid flow-guided deformation convolution to
align neighboring frames. Experimental results demonstrate that 1) the proposed
dataset can improve the HDR reconstruction performance on real scenes for three
benchmark networks; 2) Compared with sRGB inputs, utilizing raw inputs can
further improve the reconstruction quality and our proposed Raw-HDRNet is a
strong baseline for raw HDR reconstruction. Our dataset and code will be
released after the acceptance of this paper
rPPG-MAE: Self-supervised Pre-training with Masked Autoencoders for Remote Physiological Measurement
Remote photoplethysmography (rPPG) is an important technique for perceiving
human vital signs, which has received extensive attention. For a long time,
researchers have focused on supervised methods that rely on large amounts of
labeled data. These methods are limited by the requirement for large amounts of
data and the difficulty of acquiring ground truth physiological signals. To
address these issues, several self-supervised methods based on contrastive
learning have been proposed. However, they focus on the contrastive learning
between samples, which neglect the inherent self-similar prior in physiological
signals and seem to have a limited ability to cope with noisy. In this paper, a
linear self-supervised reconstruction task was designed for extracting the
inherent self-similar prior in physiological signals. Besides, a specific
noise-insensitive strategy was explored for reducing the interference of motion
and illumination. The proposed framework in this paper, namely rPPG-MAE,
demonstrates excellent performance even on the challenging VIPL-HR dataset. We
also evaluate the proposed method on two public datasets, namely PURE and
UBFC-rPPG. The results show that our method not only outperforms existing
self-supervised methods but also exceeds the state-of-the-art (SOTA) supervised
methods. One important observation is that the quality of the dataset seems
more important than the size in self-supervised pre-training of rPPG. The
source code is released at https://github.com/linuxsino/rPPG-MAE
Multi-scale Promoted Self-adjusting Correlation Learning for Facial Action Unit Detection
Facial Action Unit (AU) detection is a crucial task in affective computing
and social robotics as it helps to identify emotions expressed through facial
expressions. Anatomically, there are innumerable correlations between AUs,
which contain rich information and are vital for AU detection. Previous methods
used fixed AU correlations based on expert experience or statistical rules on
specific benchmarks, but it is challenging to comprehensively reflect complex
correlations between AUs via hand-crafted settings. There are alternative
methods that employ a fully connected graph to learn these dependencies
exhaustively. However, these approaches can result in a computational explosion
and high dependency with a large dataset. To address these challenges, this
paper proposes a novel self-adjusting AU-correlation learning (SACL) method
with less computation for AU detection. This method adaptively learns and
updates AU correlation graphs by efficiently leveraging the characteristics of
different levels of AU motion and emotion representation information extracted
in different stages of the network. Moreover, this paper explores the role of
multi-scale learning in correlation information extraction, and design a simple
yet effective multi-scale feature learning (MSFL) method to promote better
performance in AU detection. By integrating AU correlation information with
multi-scale features, the proposed method obtains a more robust feature
representation for the final AU detection. Extensive experiments show that the
proposed method outperforms the state-of-the-art methods on widely used AU
detection benchmark datasets, with only 28.7\% and 12.0\% of the parameters and
FLOPs of the best method, respectively. The code for this method is available
at \url{https://github.com/linuxsino/Self-adjusting-AU}.Comment: 13pages, 7 figure
Learning to reconstruct and understand indoor scenes from sparse views
This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation for indoor scenes. Unlike existing methods that require recording a video using a color camera and/or a depth camera, our method only needs a small number of (e.g., 3~5) color images from uncalibrated sparse views, which significantly simplifies data acquisition and broadens applicable scenarios. To achieve promising 3D reconstruction from sparse views with limited overlap, our method first recovers the depth map and semantic information for each view, and then fuses the depth maps into a 3D scene. To this end, we design an iterative deep architecture, named IterNet, to estimate the depth map and semantic segmentation alternately. To obtain accurate alignment between views with limited overlap, we further propose a joint global and local registration method to reconstruct a 3D scene with semantic information. We also make available a new indoor synthetic dataset, containing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts. Experimental results on public datasets and our dataset demonstrate that our method achieves more accurate depth estimation, smaller semantic segmentation errors, and better 3D reconstruction results over state-of-the-art methods
Deep edge map guided depth super resolution
Accurate edge reconstruction is critical for depth map super resolution (SR). Therefore, many traditional SR methods utilize edge maps to guide depth SR. However, it is difficult to predict accurate edge maps from low resolution (LR) depth maps. In this paper, we propose a deep edge map guided depth SR method, which includes an edge prediction subnetwork and an SR subnetwork. The edge prediction subnetwork takes advantage of the hierarchical representation of color and depth images to produce accurate edge maps, which promote the performance of SR subnetwork. The SR subnetwork is a disentangling cascaded network to progressively upsample SR result, where every level is made up of a weight sharing module and an adaptive module. The weight sharing module extracts the general features in different levels, while the adaptive module transfers the general features to the specific features to adapt to different degraded inputs. Quantitative and qualitative evaluations on various datasets with different magnification factors demonstrate the effectiveness and promising performance of the proposed method. In addition, we construct a benchmark dataset captured by Kinect-v2 to facilitate research on real-world depth map SR
Reference-Based Speech Enhancement via Feature Alignment and Fusion Network
Speech enhancement aims at recovering a clean speech from a noisy input, which can be classified into single speech enhancement and personalized speech enhancement. Personalized speech enhancement usually utilizes the speaker identity extracted from the noisy speech itself (or a clean reference speech) as a global embedding to guide the enhancement process. Different from them, we observe that the speeches of the same speaker are correlated in terms of frame-level short-time Fourier Transform (STFT) spectrogram. Therefore, we propose reference-based speech enhancement via a feature alignment and fusion network (FAF-Net). Given a noisy speech and a clean reference speech spoken by the same speaker, we first propose a feature level alignment strategy to warp the clean reference with the noisy speech in frame level. Then, we fuse the reference feature with the noisy feature via a similarity-based fusion strategy. Finally, the fused features are skipped connected to the decoder, which generates the enhanced results. Experimental results demonstrate that the performance of the proposed FAF-Net is close to state-of-the-art speech enhancement methods on both DNS and Voice Bank+DEMAND datasets. Our code is available at https://github.com/HieDean/FAF-Net